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Abstract 

In hybrid hidden Markov model/artificial neural networks 
(HMM/ANN) automatic speech recognition (ASR) system, the 
phoneme class conditional probabilities are estimated by first 
extracting acoustic features from the speech signal based on 
prior knowledge such as, speech perception or/and speech pro- 
duction knowledge, and, then modeling the acoustic features 
with an ANN. Recent advances in machine learning techniques, 
more specifically in the field of image processing and text pro- 
cessing, have shown that such divide and conquer strategy (i.e., 
separating feature extraction and modeling steps) may not be 
necessary. Motivated from these studies, in the framework of 
convolutional neural networks (CNNs), this paper investigates 
a novel approach, where the input to the ANN is raw speech 
signal and the output is phoneme class conditional probabil- 
ity estimates. On TIMIT phoneme recognition task, we study 
different ANN architectures to show the benefit of CNNs and 
compare the proposed approach against conventional approach 
where, spectral-based feature MFCC is extracted and modeled 
by a multilayer perceptron. Our studies show that the proposed 
approach can yield comparable or better phoneme recognition 
performance when compared to the conventional approach. It 
indicates that CNNs can learn features relevant for phoneme 
classification automatically from the raw speech signal. 
Index Terms: Automatic speech recognition. Artificial neu- 
ral networks, Convolutional neural networks. Phonemes, Data- 
driven feature extraction 

1. Introduction 

Hidden Markov model (HMM) based automatic speech recog- 
nition (ASR) system, similar to conventional pattern recogni- 
tion system, breaks the problem into several sub-tasks: fea- 
ture extraction, modeling and decision making, and optimizes 
them in independent manner. For instance, acoustic features 
such as, mel frequency cepstral coefficients (MFCC), percep- 
tual linear prediction (PLP) cepstral coefficients, linear predic- 
tion cepstral coefficients are extracted based on prior knowl- 
edge about speech perception and/or speech production. These 
features are then usually modeled by either Gaussian mixture 
models (GMM) or artificial neural networks (ANNs) to esti- 
mate state emission distribution. This step is often referred to 
as acoustic modeling. The decision making, i.e. recognition, 
step integrates the acoustic model, lexical knowledge and lan- 
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guage model/syntactical constraints (again estimated indepen- 
dently on text data) to decode the test utterance. 

In recent years, in the field of computer vision [1] and text 
processing [2] studies on sequence recognition problems similar 
to ASR have shown that such divide and conquer strategy may 
not be necessary. More precisely, these studies have shown that 
it is possible to build end-to-end systems (fed with raw input 
data) by using architectures composed of many layers, where 
each layer learns features (i.e. abstract representations), that 
are relevant to the problem of interest. 

Inspired from these studies, the present paper, as a first 
modest step, investigates estimation of phoneme class condi- 
tional probabilities from raw speech signal using convolutional 
neural networks' (CNN) [4] for phoneme sequence recogni- 
tion. In the framework of hybrid HMM/ANN system, we com- 
pare the proposed approach with the conventional approach 
of extracting spectral-based acoustic feature extraction and 
then modeling them by ANN. In addition, we also propose 
a discriminative decoding algorithm based on a simple condi- 
tional random field (CRF). Experimental studies conducted on 
TIMIT corpus show that (a) the proposed approach can yield 
a phoneme recognition system that is similar to or better than 
the system based on conventional approach and (b) CRF-based 
decoding yields better performance than conventional joint like- 
lihood based decoding. 

The remainder of the paper is organized as follows. Section 
2 presents a brief survey of related literature. Section 3 presents 
the architecture of the proposed system. Section 4 presents the 
experimental setup and Section 5 presents the results. Section 6 
presents an analysis. Section 7 provides a discussion and Sec- 
tion 8 concludes the paper. 

2. Related Work 

Despite the success of spectral-based acoustic features, there 
has been interest in modeling raw speech signal for speech 
recognition. In one of the earliest work, Poritz proposed an 
approach where the speech signal is modeled by a linear pre- 
diction HMM [5]. This work was later revisited as switch- 
ing autoregressive HMM [6], and more recently in the frame- 
work of switching linear dynamical systems [7]. Experi- 
ments on isolated word/digit recognition task have shown that 
these approaches can yield performance comparable to stan- 
dard cepstral-based HMM system in clean conditions, and bet- 
ter performance under noisy conditions [7]. In [8], an ap- 
proach to model raw speech signal was proposed. In this 
approach, the signal statistical characteristics are modeled as 



In speech literature, CNN is referred to as time-delay neural net- 
work [3]. 



the output of a filter excited by a Gaussian source. The po- 
tential of the approach was demonstrated on classification of 
speaker-dependent discrete utterances consisting of 18 highly 
confusable stop consonant-vowel syllables. More recently, 
combination of raw speech and cepstral features in the frame- 
work of support vector machine has been investigated for noisy 
phoneme classification [9]. 

3. Proposed system 

Compared to classical approaches, convolutional neural net- 
works alleviate the problem of designing/choosing the right fea- 
tures for a particular task of interest. These networks can be fed 
with raw signal, and learn low-level or mid-level features in a 
end-to-end manner [10, 11]. 

The proposed system is composed of two parts: the estima- 
tion of the phoneme class conditional probabilities and the de- 
coding of the sequence. The first part is performed by a CNN, 
which takes raw speech signal as input. For second part, a sim- 
ple CRF will be used to decode the sequence. 

3.1. Convolutional Neural Network 

The network is given a window of raw input signal and com- 
putes the conditional probability p{i\x) for each phoneme class 
i. One class is then attributed to an example by computing 
argmax(p(i|a;)). These type of network architectures are com- 
posed of several filter extraction stages, followed by a classifi- 
cation stage. A filter extraction stage involves a convolutional 
layer, followed by a temporal pooling layer and an non-linearity 
(tanh()). Our optimal architecture included 3 stages of filter ex- 
traction (see Figure 1). Signal coming out of these filter stages 
are fed to a classification stage, which in our case was a one- 
hidden layer MLP. The last layer is a softmax layer, which com- 
putes the conditional probability. 
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Figure 2: Illustration of a convolutional layer, din and dout are 
the dimension of the input and output frames. kW is the kernel 
width (here kW = 3) and dW is the shift between two linear 
applications (here, dW — 2). 
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Figure 3: Illustration of max-pooling layer kW is the number 
of frame taken for each max operation and d represents the 
dimension of input/output frames (which are equal). 



3.1.1. Convolutional layer 

While "classical" linear layers in standard MLPs accept a fixed- 
size input vector, a convolution layer is assumed to be fed with 



a sequence of T vectors/frames: X = {3 



}.A 



convolutional layer applies the same linear transformation over 
each successive (or interspaced by dW frames) windows of kW 
frames. E.g, the transformation at frame t is formally written as: 



3.1.3. SoftMax layer 

The Softmax [12] layer interprets network output scores fi (x) 
as conditional probabilities, for each class label i: 
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3.1.4. Network training 

The network parameters 6 are learned by maximizing the log- 
likelihood L, given by: 



where M is a dout x din matrix of parameters. In other words, 
dout filters (rows of the matrix M) are applied to the input se- 
quence. An illustration is provided in Figure 2. 

3.1.2. Max-pooling layer 

These kind of layers perform local temporal max operations 
over an input sequence, as shown in Figure 3. More formally, 
the transformation at frame t is written as: 



max fi 

t-{kW-l)/2<s<t+(kW-l)/2 



\/i 



(2) 



These layers increase the robustness of the network to slight 
temporal distortions in the input. 



L(Mi, ..., ML,e)^Y^ iog(p(i„li-n, e)) 



(4) 



for each input x and label i, over the whole training set, with re- 
spect to the parameters of each layer Mi . Defining the 1 o gadd 
operation as: logaddj(zi) = log(5]]j e^'), the likelihood L can 
be expressed as: 



L = log(p(ija;)) = fi{x) - logadd(/j(a;)) 



(5) 



where fi(x) described the network score of input x and class 
i. Maximizing this likelihood is performed using the stochastic 
gradient ascent algorithm [13]. 
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Figure 1 : Convolutional Neural Network. Several stages of convolution/pooling/tanh might be considered. Our network included 3 
stages. 



3.1.5. Designing and tuning the network 

The number of convolution and pooling layers, as well as the 
size of the kernels kW and the shift dW are all chosen by vali- 
dation. It is worth mentioning that for a given input window size 
over the row signal, the size of the output of the filter extraction 
stage will strongly depend on the number of max-pooling lay- 
ers, each of them dividing the output size of the filter stage by 
the chosen pooling kernel width. As a result, adding pooling 
layers reduces the input size of the classification stage, which 
in returns reduces the number of parameters of the network (as 
most parameters do lie in the classification stage). 

3.2. Decoder 

We consider a very simple version of CRFs, where we define 
a graph with nodes for each frame in the input sequence, and 
each label. This CRF allows to us to discriminatively train a 
simple duration model over our network output scores. Transi- 
tion scores are assigned to edges between phonemes, and net- 
work output scores are assigned to nodes. Given an input data 
sequence x and a label path on the graph y, a score for the path 
can be defined: 



S{X,y) ^^{fvA^t) + Ay^,y^_^) 



(6) 



where A is a matrix describing transitions between labels and 
fy^ {xt) the network score of input x for class y at time t. Path 
scores are interpreted as conditional probabilities, by applying 
a softmax (see Section 3.1.3) over all possible paths. The CRF 
transitions scores are then trained by maximizing the likelihood 
over the training data, with a gradient ascent. 

4. Experimental Setup 

In this section we present the setup used for the experiments, as 
well as the different features and the decoding algorithms. 

4.1. TIMIT Corpus 

The TIMIT acoustic-phonetic corpus consists of 3,696 training 
utterances (sampled at 16kHz) from 462 speakers, excluding 
the SA sentences. The cross-validation set consists of 400 ut- 
terances from 50 speakers. The core test set was used to report 
the results. It contains 192 utterances from 24 speakers, exclud- 
ing the validation set. The 61 hand labeled phonetic symbols 
are mapped to 39 phonemes with an additional garbage class, 
as presented in [14]. 

4.2. Features 

4.2.1. Raw 

Features are simply composed of a window of the speech signal 
(hence din = 1, for the first convolutional layer as shown in 



Figure 1). The window is normalized such that it has mean 
and variance 1. Using raw data allows us to learn filters with 
minimal priors. 

4.2.2. MFCC 

We also performed several experiments, with MFCC as input 
features. They were computed (with HTK [15]) using a 25 ms 
Hamming window on the speech signal, with a shift of 10 ms. 
The signal is represented using 13th-order coefficients along 
with their first and second derivatives, computed on a 9 frames 
context (din = 39 for the first convolutional layer). 

4.3. Network hyper-parameters 

The hyper-parameters of the network were hand-tuned using 
a cross-validation set. Ranges which were considered are re- 
ported in Table 1 . 



Table 1 : Network hyper-parameters 


Parameter 


Range 


Input window size (ms) 

Kernel width (kW) 

Number of filters per kernel (dout) 

Number of hidden units in the class, stage 


100-700 

1-9 

10-90 

100-1500 



• Input window size: this parameter corresponds to the 
context taken along with each example. In the raw fea- 
ture experiment, it was set to 270 ms. For the MFCC 
experiment, 30 frames were taken as context. 

• In the raw case, the kernel width of the first, second and 
third convolutional layers were set to 10, 5 and 9, re- 
spectively. For MFCC experiments, they were set to 39, 
5 and 7, respectively. 

• Number of filters: all convolutions had 90 filters for the 
raw experiments, and 80 for the MFCC experiments. 

• The number of hidden units was set to 500. 

• The MFCC-based networks had no pooling layer. We 
found pooling operations were decreasing the perfor- 
mance with these features, while they are crucial for raw 
signal input experiments (see Section 6.1). This is not 
surprising, as MFCCs are sufficiently engineered to work 
well with simple network classifiers. 

The experiments were implemented using the torch? toolbox 
[16]. As a comparison, an MLP architecture will also be tested. 
It is composed of two layers. The hidden layer width was set to 
500 units. 



Filter 58 




2 4 

Frequency [kHz] 



2 4 6 
Frequency [kHz] 



2 4 

Frequency [kHz] 



2 4 

Frequency [kHz] 



2 4 

Frequency [kHz] 



Figure 4: Frequency responses of filters learned in the first convolutional layer. 



4.4. Decoding 

We used the simple CRF approach described in Section 3.2 as 
decoding algorithm. We also report experimental results with a 
standard HMM decoder, with constrained duration of 3 states, 
and considering all phoneme equally probable. 

5. Results 

We propose to evaluate the network capacity to estimate condi- 
tional probabilities by a phoneme sequence recognition experi- 
ment on the TIMIT database. The results are presented in Table 
2, in term of phoneme accuracy for the different features and 
decoding scheme, along with the number of parameters. Us- 
ing raw speech, the CNN architecture outperforms the baseline, 
and the CRF approach increases the accuracy compared to the 
HMM approach. Using MFCC features with the CNN architec- 
ture yield similar performance as the raw features. The baseline 
accuracy is consistent with other works, although a bit lower, 
certainly due to the absence of supplementary processing, like 
speaker-level mean variance normalisation in [17]. 



Table 2: Phoneme recognition accuracy on the core test set of 
TIMIT corpus. 



Table 3: Max-pooling (MP) layers contribution 



Features 


Arch. 


Decoding 


nbr. param. 


Test ace. 


MFCC 
Raw 


MLP 
MLP 


HMM 
HMM 


196 '040 
740'540 


66.65 
38.91 


Raw 
Raw 


CNN 
CNN 


HMM 
CRF 


720' 110 


67.88 
69.47 


MFCC 
MFCC 


CNN 
CNN 


HMM 
CRF 


860'700 


70.52 
71.80 



6. Analysis 

6.1. Advantage of max-poollng layers 

We varied the number of pooling layers, to evaluate their contri- 
bution in the overall performance of the architecture. The other 
hyper-parameters were tuned such that the same input window 
size was kept for each architecture. The output dimension of 
each convolution were also tuned for each case (to reduce over- 
fitting due to a too large number of parameters). The phoneme 
accuracy of each architecture is reported in Table 3, using raw 
features and HMM decoding, along with the number of param- 
eters of the network. Clearly, adding max-pooling layer im- 
proves the system performance while providing an easy way to 
reduce the number of parameters (see Section 3.1.5). 



Number of 
MP layers 


Network 
parameters 


Test 
Accuracy 


3 
2 
1 



303'460 
380'660 
507' 860 
593'460 


67.60 
67.18 
67.14 
64.96 



6.2. Filters trained in the first layer 

Figure 4 presents the response of five randomly chosen filters^. 
Clearly, each filter responds to different frequencies of the input 
signal. In our future work, we will investigate the relationship 
between the filters learned and the task at hand. 

7. Discussion 

Using the CNN architecture with raw speech data shows a great 
improvement compared to the classical MLP system, which 
suggest that this architecture can indeed learn features. More- 
over, it outperforms the baseline, with almost no pre-processing 
on the data. These results suggest that deep architecture can 
learn efficient features and more importantly, that it is possi- 
ble to achieve similar performances than complex hand-crafted 
features, which question their use. 

When adding a decoder, the CRF approach seems to work 
better than the generative HMM approach, even as the CRF has 
no duration constraints, compared to the three-state duration 
constraint applied on the HMM. It is still optimized indepen- 
dently, but end-to-end training is possible with this framework, 
and might lead to better performances. 

8. Conclusions 

In this paper, we proposed to use convolutional neural networks 
to estimate phoneme class probabilities. Our system is able to 
learn features by taking raw speech data as input and outper- 
forms baseline systems. Moreover, using MFCC feature as in- 
put yield comparable performances. 

For future work, we plan to evaluate the robustness of our 
architecture with studies in noisy conditions. Secondly, as this 
work was intended as a first step for an end-to-end trained sys- 
tem, we plan to develop such a system applying the Graph 
Transformer Networks [18] approach, integrating the decoding 
step in our network. From there, we will focus on developing 
more specific applications, such as Spoken Term Detection. 
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